menu
arrow_back

Computing Time-Windowed Features in Cloud Dataprep

search help
Help

Computing Time-Windowed Features in Cloud Dataprep

2 hours Free

Overview

In this lab you will ingest, transform, and analyze a taxi cab dataset using Google Cloud Dataprep. You will calculate key reporting metrics like the average number of passengers picked up in the past hour.

What you learn

In this lab, you:

  • Build a new Flow using Cloud Dataprep
  • Create and chain transformation steps with recipes
  • Running Cloud Dataprep jobs (Dataflow behind-the-scenes)

Cloud Dataprep is Google's self-service data preparation tool. In this lab, you will learn how to clean and enrich multiple datasets using Cloud Dataprep.

Setup

For each lab, you get a new Google Cloud project and set of resources for a fixed time at no cost.

  1. Make sure you signed into Qwiklabs using an incognito window.

  2. Note the lab's access time (for example, img/time.png and make sure you can finish in that time block.

  1. When ready, click img/start_lab.png.

  2. Note your lab credentials. You will use them to sign in to the Google Cloud Console. img/open_google_console.png

  3. Click Open Google Console.

  4. Click Use another account and copy/paste credentials for this lab into the prompts.

  1. Accept the terms and skip the recovery resource page.

Check project permissions

Before you begin your work on Google Cloud, you need to ensure that your project has the correct permissions within Identity and Access Management (IAM).

  1. In the Google Cloud console, on the Navigation menu (nav-menu.png), click IAM & Admin > IAM.

  2. Confirm that the default compute Service Account {project-number}-compute@developer.gserviceaccount.com is present and has the editor role assigned. The account prefix is the project number, which you can find on Navigation menu > Home.

check-sa.png

If the account is not present in IAM or does not have the editor role, follow the steps below to assign the required role.

  • In the Google Cloud console, on the Navigation menu, click Home.

  • Copy the project number (e.g. 729328892908).

  • On the Navigation menu, click IAM & Admin > IAM.

  • At the top of the IAM page, click Add.

  • For New members, type:

{project-number}-compute@developer.gserviceaccount.com

Replace {project-number} with your project number.

  • For Role, select Project (or Basic) > Editor. Click Save.

add-sa.png

Create a new Storage Bucket

Skip this section if you already have a GCS Bucket

Step 1

Open the Google Cloud Console at console.cloud.google.com.

Step 2

Go to Storage in the Navigation menu (left-side navigation).

Step 3

Click Create Bucket (or use an existing bucket).

Step 4

In the Create a bucket window that will appear, add a unique bucket name, such as your Qwiklabs Project ID, and leave the remaining settings at their default values.

Step 5

Click Create.

Step 6

You now have a Cloud Storage Bucket which you will use to store raw data for ingestion into Google BigQuery later and for storing Cloud Dataprep settings.

Create BigQuery Dataset to store Cloud Dataprep Output

Step 1

Open BigQuery: In the Google Cloud Console, select Navigation menu > BigQuery:

nav-bq.png

The Welcome to BigQuery in the Cloud Console message box opens providing a link to the quickstart guide describes UI updates.

Click Done.

Step 2

In the left side bar, click on your project name.

Step 3

Click CREATE DATASET.

Step 4

For Dataset ID, type taxi_cab_reporting and select Create dataset.

Now you have a new empty dataset that you can populate with tables.

Launch Cloud Dataprep

Step 1

Open the Navigation menu.

Step 2

Under Big Data, click on Dataprep.

Step 3

Accept the Terms of Service.

Step 4

Click Agree and Continue.

5eabff1419ebfaea.png

Click Allow for Trifacta to access project data. Dataprep is provided in collaboration with Trifacta, a Google partner. 59ddf08c1bbcf24b.png

Step 5

In the Sign in with Google window appears, select your Qwiklabs account and then click Allow. Accept the Trifacta Terms of Service if prompted.

18fc12676fc080ed.png

Step 6

When prompted for First time set up, click Continue.

Step 7

Wait for Cloud Dataprep to initialize (typically less than one minute).

Import NYC Taxi Data from GCS into a Dataprep Flow

Step 1

In the Cloud Dataprep UI, click Create Flow on the top right side of the page.

Step 2

Specify the following Flow details:

Flow Name

Flow Description

NYC Taxi Cab Data Reporting

Ingesting, Transforming, and Analyzing Taxi Data

Click Create.

If prompted, dismiss the helper tutorial (click Don't show me any helpers).

Step 3

Click Import & Add Datasets.

Step 4

In the data importer left side menu, click GCS (Google Cloud Storage).

Step 5

Click the Pencil icon to edit the GCS path.

Step 6

Paste in the 2015 taxi rides dataset CSV from Google Cloud Storage:

gs://cloud-training/gcpml/c4/tlc_yellow_trips_2015_small.csv

Click Go.

Step 7

Before selecting Import, click the Pencil Icon to edit the GCS path a second time and paste in the 2016 CSV below:

gs://cloud-training/gcpml/c4/tlc_yellow_trips_2016_small.csv

Click Go.

Step 8

Click Import & Add to Flow.

step_8.png

Step 9

Wait for the datasets to be loaded into DataPrep.

The tool loads up to a 10MB sample of the underlying data as well as connects to and ingests the original data source when the flow is ran.

Step 10

Click on the tlc_yellow_trips_2015_small icon and select Add > Recipe.

Add_Recipe.png

Step 11

Click Edit Recipe.

Wait for Dataprep to load your data sample into the explorer view

Step 12

In the explorer view, find the trip_distance column and examine the histogram.

True or False, the majority of the cab rides for 2015 were less than 2 miles?

step_12.png

True. In the sample, around 57% were between 0 to 2 miles.

Next combine the 2016 and 2015 datasets.

Step 13

In the navigation bar, find the icon for Union and select it.

cc1a658aa4b1a32e.png

Step 14

In the Union Page, click Add data.

In the popup window, select tlc_yellow_trips_2016_small and click Apply.

Step 15

Confirm the union looks like below (UNION DATA (2)) and then click Add to Recipe.

step_15.png

Wait for Dataprep to Apply the Union.

Now you have a single table with 2016 and 2015 taxicab data.

Exploring your Data

Step 16

Examine the pickup_time histogram. Which hours had the fewest amount of pickups? The most?

In the sample, the early morning hours (5 - 6am) had the fewest taxicab pickups.

step_16_a.png

The most taxi cab pickups were in the evening hours with 19:00 (7pm) having slightly more than others.

step_16_b.png

Is this unusual? Would you expect NYC taxi cab trips to be clustered around lunch and earlier hours in the day? Let's continue exploring.

Examine the pickup_day histogram. Which months and years of data are in the dataset?

  • Only December 2015 and December 2016

step_16_c.png

Examine the dropoff_day histogram. Is there anything unusual about it when compared to pickup_day? Why are there records for January 2017?

step_16_d.png

Answer: There are a few trips that start in December and end in January (spending New Years in a taxicab!).

Next, concatenate the date and time fields into a single timestamp.

Step 17

In the navigation bar, find Merge columns.

step_17.png

For columns to merge, specify pickup_day and pickup_time.

For the separator, press your spacebar to type a single space.

Name the new column pickup_datetime.

Preview and click Add.

d6a4d0d6a9a348ef.png

Confirm your new field is properly registering now as a datetime datatype (clock icon).

9273009ca2303d8b.png

Step 18

Next, you want to create a new derived column to count the average amount of passengers in the last hour. To do that, you need to create a new column to get hourly data and perform a calculation.

Find the Functions list in the navigation bar.

Select Dates and times.

Select DATEFORMAT.

9f634021f028bd3f.png

In the Formula cell, paste the following which will reformat the pickup date and time. Specify the New column name as pickup_hour:

DATEFORMAT(pickup_datetime,"yyyyMMddHHmmss")

step_18.png

Confirm that the new derived column is shown correctly as a Preview column.

step_18_a.png

Click Add.

Step 19

In order to get the field to properly display only the hour of the pickup, you are going to replace the minutes and seconds with a placeholder of 0000.

In the navigation bar, find Replace and then select Between positions.

step_19.png

For Column select Multiple and then pickup_hour.

For Start position, specify 10.

For End position, specify 14.

For Replace with, specify 0000.

step_19_a.png

Click Add.

You now have a taxicab hourly pickup column. Next, you will calculate the average count of passengers over the past hour. You will do this through aggregations and a rolling window average function.

Step 20

In the navigation toolbar, select Functions > Aggregation > AVERAGE.

Formula_Average.png

For Group by, specify pickup_hour.

For Values, specify AVERAGE(fare_amount).

For Type, specify Group by as new column(s).

step_20.png

Click Add.

You now have an average cab fares field.

Step 21

Next, calculate a rolling window of average fares over the past 3 hours.

Step 22

In the navigation toolbar, select Functions > Window > ROLLINGAVERAGE.

Rolling_Avg_Menu.png

Copy in the below formula which computes the rolling average of passenger count for the last hour.

For Formula, specify ROLLINGAVERAGE(average_fare_amount, 3, 0).

For Sort rows by, specify -pickup_hour

Note: The preceeding negative sign "-" before pickup_hour is intended. Sorting is on recent taxicab rides first (the negative sign -pickup_hour indicates descending order) and operating over a rolling 3 hour period.

For New column name, specify rollingaverage_fare_amount.

step_22.png

Click Add.

Step 23

Toggle open the Recipe icon to preview your final transformation steps.

Recipe_List.png

Step 24

Click Run Job.

Step 25

In Publishing Actions page, under Settings, edit the publishing location by clicking the Edit button.

step_25.png

Choose BigQuery and the taxi_cab_reporting BigQuery dataset as where you want to create the output table.

(Note: if you do not see a taxi_cab_reporting dataset, refer to the start of this lab for instructions on how to create it in BigQuery)

Choose Create a new table.

Name the table tlc_yellow_trips_reporting.

Choose Drop the table every run.

Select Update.

Step 26

Select Run Job.

Step 27

Wait for your Dataprep job to complete and confirm your new new table shows in BigQuery. The data should be loaded in 5 to 8 minutes.

Step 28

While your Cloud Dataprep flow starts and manages your Cloud Dataflow job, you can preview your output results by using BigQuery to query this pre-populated table:

#standardSQL
SELECT
  pickup_hour,
  FORMAT("$%.2f",ROUND(average_3hr_rolling_fare,2)) AS avg_recent_fare,
  ROUND(average_trip_distance,2) AS average_trip_distance_miles_by_hour,
  FORMAT("%'d",sum_passenger_count) AS total_passengers_by_hour
FROM
  `cloud-training-demos.demos.nyc_taxi_reporting`
ORDER BY
  pickup_hour DESC;

Step 29

Your Dataprep job will complete successfully and the new table will be created and populated in BigQuery.

Dataprep_Complete.png

BQ_Complete.png

Congratulations!

You have now built a data transformation pipeline using the Cloud Dataprep UI.

For full documentation and additional tutorials, refer to the Cloud Dataprep support page.

End your lab

When you have completed your lab, click End Lab. Qwiklabs removes the resources you’ve used and cleans the account for you.

You will be given an opportunity to rate the lab experience. Select the applicable number of stars, type a comment, and then click Submit.

The number of stars indicates the following:

  • 1 star = Very dissatisfied
  • 2 stars = Dissatisfied
  • 3 stars = Neutral
  • 4 stars = Satisfied
  • 5 stars = Very satisfied

You can close the dialog box if you don't want to provide feedback.

For feedback, suggestions, or corrections, please use the Support tab.

©2020 Google LLC All rights reserved. Google and the Google logo are trademarks of Google LLC. All other company and product names may be trademarks of the respective companies with which they are associated.